115 research outputs found

    Using Ontologies for the Design of Data Warehouses

    Get PDF
    Obtaining an implementation of a data warehouse is a complex task that forces designers to acquire wide knowledge of the domain, thus requiring a high level of expertise and becoming it a prone-to-fail task. Based on our experience, we have detected a set of situations we have faced up with in real-world projects in which we believe that the use of ontologies will improve several aspects of the design of data warehouses. The aim of this article is to describe several shortcomings of current data warehouse design approaches and discuss the benefit of using ontologies to overcome them. This work is a starting point for discussing the convenience of using ontologies in data warehouse design.Comment: 15 pages, 2 figure

    A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses

    Get PDF
    In data warehousing, Extract, Transform, and Load (ETL) processes are in charge of extracting the data from the data sources that will be contained in the data warehouse. Their design and maintenance is thus a cornerstone in any data warehouse development project. Due to their relevance, the quality of these processes should be formally assessed early in the development in order to avoid populating the data warehouse with incorrect data. To this end, this paper presents a set of measures with which to evaluate the structural complexity of ETL process models at the conceptual level. This study is, moreover, accompanied by the application of formal frameworks and a family of experiments whose aim is to theoretical and empirically validate the proposed measures, respectively. Our experiments show that the use of these measures can aid designers to predict the effort associated with the maintenance tasks of ETL processes and to make ETL process models more usable. Our work is based on Unified Modeling Language (UML) activity diagrams for modeling ETL processes, and on the Framework for the Modeling and Evaluation of Software Processes (FMESP) framework for the definition and validation of the measures.In data warehousing, Extract, Transform, and Load (ETL) processes are in charge of extracting the data from the data sources that will be contained in the data warehouse. Their design and maintenance is thus a cornerstone in any data warehouse development project. Due to their relevance, the quality of these processes should be formally assessed early in the development in order to avoid populating the data warehouse with incorrect data. To this end, this paper presents a set of measures with which to evaluate the structural complexity of ETL process models at the conceptual level. This study is, moreover, accompanied by the application of formal frameworks and a family of experiments whose aim is to theoretical and empirically validate the proposed measures, respectively. Our experiments show that the use of these measures can aid designers to predict the effort associated with the maintenance tasks of ETL processes and to make ETL process models more usable. Our work is based on Unified Modeling Language (UML) activity diagrams for modeling ETL processes, and on the Framework for the Modeling and Evaluation of Software Processes (FMESP) framework for the definition and validation of the measures

    Word embeddings for retrieving tabular data from research publications

    Get PDF
    Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work is part of the project TED2021-130890B-C21, funded by MCIN/AEI/10.1 3039501100011033 and by the European Union NextGenerationEU/PRTR. Alberto Berenguer has a contract for predoctoral training with the Generalitat Valenciana and the European Social Fund, funded by the grant ACIF/2021/507

    Open business intelligence: on the importance of data quality awareness in user-friendly data mining

    Get PDF
    Citizens demand more and more data for making decisions in their daily life. Therefore, mechanisms that allow citizens to understand and analyze linked open data (LOD) in a user-friendly manner are highly required. To this aim, the concept of Open Business Intelligence (OpenBI) is introduced in this position paper. OpenBI facilitates non-expert users to (i) analyze and visualize LOD, thus generating actionable information by means of reporting, OLAP analysis, dashboards or data mining; and to (ii) share the new acquired information as LOD to be reused by anyone. One of the most challenging issues of OpenBI is related to data mining, since non-experts (as citizens) need guidance during preprocessing and application of mining algorithms due to the complexity of the mining process and the low quality of the data sources. This is even worst when dealing with LOD, not only because of the different kind of links among data, but also because of its high dimensionality. As a consequence, in this position paper we advocate that data mining for OpenBI requires data quality-aware mechanisms for guiding non-expert users in obtaining and sharing the most reliable knowledge from the available LOD

    Business Intelligence and the Web. Guest editors' introduction

    Get PDF
    The special issue of Information Systems Frontiers on Business Intelligence and the Web discusses the increasing use of Business Intelligence (BI) solutions that allow enterprise to query, understand, and analyze business data to make better decisions. The special issue targets the issues of Web data feeding BI and engineering Web-enabled BI in the discussion. It states that the amount and complexity of data available on the Web has been growing rapidly over a period of time. Designers of BI applications making use of data from the Web have to deal with several challenges due to these developments. The importance of Web opinion feeds as information sources for companies is highlighted in an article entitled 'Storing and Analyzing Voice of the Market Data in the Corporate Data Warehouse'. The authors propose a technique to integrate opinion data in BI models. Specifically, they present a multidimensional data model that integrates sentiment data extracted from customer opinion forums into the corporate data warehouse

    Requirements Engineering in the Development Process of Web Systems: A Systematic Literature Review

    Get PDF
    Requirements Engineering (RE) is the first phase in the software development process during which designers attempt to fully satisfy users’ needs. Web Engineering (WE) methods should consider adapting RE to the Web’s large and diverse user groups. The objective of this work is to classify the literature with regard to the RE applied in WE in order to obtain the current “state-of-the-art”. The present work is based on the Systematic Literature Review (SLR) method proposed by Kitchenham; we have reviewed publications from ACM, IEEE, Science Direct, DBLP and World Wide Web. From a population of 3059 papers, we identified 14 primary studies, which provide information concerning RE when used in WE methods.This work has been partially supported by the Programa de Fomento y Apoyo a Proyectos de Investigación (PROFAPI) from the Universidad Autónoma de Sinaloa (México), and the MANTRA project (GRE09-17) from the University of Alicante, Spain, and GV/2011/035 from the Valencia Government

    Usefulness of open data to determine the incidence of COVID-19 and its relationship with atmospheric variables in Spain during the 2020 lockdown

    Get PDF
    The SARS-CoV-2 pandemic and the spread of the COVID-19 disease led to a lockdown being imposed in Spain to minimise contagion from 16 March 2020 to 1 May 2020. Over this period, measures were taken to reduce population mobility (a key factor in disease transmission). The scenario thus created enabled us to examine the impact of factors other than mobility (in this case, meteorological conditions) on the incidence of the disease, and thus to identify which environmental variables played the biggest role in the pandemic's evolution. Worthy of note, the data required to perform the study was entirely extracted from governmental open data sources. The present work therefore demonstrates the utility of such data to conduct scientific research of interest to society, leading to studies that are also fully reproducible. The results revealed a relationship between temperatures and the spread of COVID-19. The trend was that of a slightly lower disease incidence as the minimum temperature rises, i.e. the lower the minimum temperature, the greater the number of cases. Furthermore, a link was found between the incidence of the disease and other variables, such as altitude and proximity to the sea. There were no indications, however, in the study's data, of a relationship between incidence and precipitation or wind.This work is funded by GVA-COVID19/2021/103 project from “Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana”

    Framework for Prioritization of Open Data Publication: An Application to Smart Cities

    Get PDF
    Public Sector Information is considered to play a fundamental role in the growth of the knowledge economy and improvements in society. Given the difficulty in publishing and maintaining all available data, due to budget constraints, institutions need to select which data to publish, giving priority to data most likely to generate social and economic impact. Priority of publication could become an even more significant problem in Smart Cities: as huge amounts of information are generated from different domains, the way data is prioritized and thus reused, could be a determining factor in promoting, among others, new and sustainable business opportunities for local entrepreneurs, and to improve citizen quality of life. However, people in charge of prioritizing which data to publish through open data portals (such as Chief Data Officers, or CDOs) do not have available any specific support in their decision-making process. In this work, a proposal of a framework for prioritization of open data publication as well as its application to Smart Cities is presented. This specific application of the framework relies on OSS (Open Source Software) indicators to help making decisions on the most relevant data to publish focused on developers and businesses operating within the Smart City context.This work was funded by (i) Ministerio de Economía e Innovación (Spain) TIN2015-69957-R (MINECO/ERDF, EU) project and TIN2016-78103-C2-2-R (MINECO/ERDF, EU) project, (ii) POCTEP 4IE project (0045-4IE-4-P), and (iii) Consejería de Economía e Infraestructuras/Junta de Extremadura (Spain) - European Regional Development Fund (ERDF)- GR18112 project and IB16055 project

    Modelling ETL processes of data warehouses with UML activity diagrams

    Get PDF
    Extraction-transformation-loading (ETL) processes play an important role in a data warehouse (DW) architecture because they are responsible of integrating data from heterogeneous data sources into the DW repository. Importantly, most of the budget of a DW project is spent on designing these processes since they are not taken into account in the early phases of the project but once the repository is deployed. In order to overcome this situation, we propose using the unified modelling language (UML) to conceptually model the sequence of activities involved in ETL processes from the beginning of the project by using activity diagrams (ADs). Our approach provides designers with easy-to-use modelling elements to capture the dynamic aspects of ETL processes.Extraction-transformation-loading (ETL) processes play an important role in a data warehouse (DW) architecture because they are responsible of integrating data from heterogeneous data sources into the DW repository. Importantly, most of the budget of a DW project is spent on designing these processes since they are not taken into account in the early phases of the project but once the repository is deployed. In order to overcome this situation, we propose using the unified modelling language (UML) to conceptually model the sequence of activities involved in ETL processes from the beginning of the project by using activity diagrams (ADs). Our approach provides designers with easy-to-use modelling elements to capture the dynamic aspects of ETL processes

    A BPMN-Based Design and Maintenance Framework for ETL Processes

    Get PDF
    Business Intelligence (BI) applications require the design, implementation, and maintenance of processes that extract, transform, and load suitable data for analysis. The development of these processes (known as ETL) is an inherently complex problem that is typically costly and time consuming. In a previous work, we have proposed a vendor-independent language for reducing the design complexity due to disparate ETL languages tailored to specific design tools with steep learning curves. Nevertheless, the designer still faces two major issues during the development of ETL processes: (i) how to implement the designed processes in an executable language, and (ii) how to maintain the implementation when the organization data infrastructure evolves. In this paper, we propose a model-driven framework that provides automatic code generation capability and ameliorate maintenance support of our ETL language. We present a set of model-to-text transformations able to produce code for different ETL commercial tools as well as model-to-model transformations that automatically update the ETL models with the aim of supporting the maintenance of the generated code according to data source evolution. A demonstration using an example is conducted as an initial validation to show that the framework covering modeling, code generation and maintenance could be used in practice
    corecore